Skip to content

[Track A / PR-1] ggml: castle DDTree extensions on top of luce-org TQ3+tree-kernels#4

Draft
Leechael wants to merge 1 commit into
base/luce-org-tq3from
track-a/ggml
Draft

[Track A / PR-1] ggml: castle DDTree extensions on top of luce-org TQ3+tree-kernels#4
Leechael wants to merge 1 commit into
base/luce-org-tq3from
track-a/ggml

Conversation

@Leechael
Copy link
Copy Markdown
Owner

@Leechael Leechael commented May 5, 2026

Tracking: #3 (Phase 1, Track A, ggml layer)

Stack

This is PR-1 in a stacked pair:

Scope

All ggml-layer changes that sit on top of luce-org PR #1 (TQ3_0 KV cache + tree-mode SSM/GDN kernels merged via PR #1 = 1823460262).

luce-org-side fixes (already PR'd to luce-org separately as PR #2-#5, included here as a stacked snapshot)

  • 3e80ebc8a fattn-chunked routing fix (fattn-chunked.cu, fattn.cu)
  • c253e49b9 consumer Blackwell sm_120 skip (no FP4 MMA)
  • 6de9f7bb2 cuMem pool extension race fix
  • 07fe012aa turbo_wht parallelization (1 → 128 threads/block)

castle-side extensions (no upstream yet)

  • CPU-side ssm_conv_tree kernel (ggml/src/ggml-cpu/ops.cpp)
  • WITH_PERSIST template branch for ssm_conv_tree (ggml-cuda/ssm-conv.cu)
  • ggml op declarations / registrations (ggml.h, ggml.c)

Stat

5 files changed, 265 insertions, 33 deletions.

Status

Not for merge into luce-org master — this is a fork-side organizational record. Castle's working set assumes this branch is merged on top of the luce-org 1823460262 snapshot.


Summary by cubic

Adds tree-mode SSM conv with per-token persistent state for DDTree/DFS decoding, plus CPU/CUDA support and a new API ggml_ssm_conv_tree_persist. Also makes chunked flash-attention prefill more resilient under low VRAM by auto-reducing chunk size. Part of Linear #3 (Phase 1, Track A).

  • New Features

    • New ggml_ssm_conv_tree_persist(ctx, sx, c, parent_ids, persist_inter) op that writes each token’s (K-1)-element post-state to persist_inter (contiguous F32, shape [K-1, d_inner, n_tokens, n_seqs]).
    • CPU: ssm_conv forward now supports tree mode via parent_ids and optionally emits per-token post-state; gated_delta_net_one_chunk supports parent_ids rollback and external state persistence (F32/F16).
    • CUDA: adds WITH_PERSIST path to tree-mode ssm_conv and plumbs persist_inter through ggml_cuda_op_ssm_conv.
  • Bug Fixes

    • ggml-cuda/fattn-chunked.cu: handle scratch OOM by halving tbq_chunk, freeing failed buffers, and retrying instead of aborting (with a warning).

Written for commit c3692ea. Summary will update on new commits.

Aggregates all ggml-layer changes that sit on top of luce-org PR #1
(merge 1823460, TQ3_0 KV cache + b16de65 tree-mode SSM/GDN kernels):

luce-org-side fixes (already PR'd separately to luce-org as PR #2-#5):
- 3e80ebc fattn-chunked routing fix (fattn-chunked.cu, fattn.cu)
- c253e49 consumer Blackwell sm_120 skip
- 6de9f7b cuMem pool extension race fix
- 07fe012 turbo_wht parallelization

castle-side extensions (no upstream yet):
- CPU-side ssm_conv tree kernel  (ggml/src/ggml-cpu/ops.cpp)
- WITH_PERSIST template branch for ssm_conv_tree (ssm-conv.cu)
- ggml op declarations / registrations (ggml.h, ggml.c)

Tracking: #3 (Phase 1, Track A, ggml layer)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant